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Abstract 

Predictive recursion (PR) is a fast stochastic algorithm for nonparametric es- 
timation of mixing distributions in mixture models. It is known that the PR es- 
timates of both the mixing and mixture densities are consistent under fairly mild 
conditions, but currently very little is known about the rate of convergence. Here I 
first investigate asymptotic convergence properties of the PR estimate under model 
misspecification in the special case of finite mixtures with known support. Tools 
from stochastic approximation theory are used to prove that the PR estimates con- 
verge, to the best Kullback-Leibler approximation, at a nearly root-n rate. When 
the support is unknown, PR can be used to construct an objective function which, 
when optimized, yields an estimate the support. I apply the known-support re- 
sults to derive a rate of convergence for this modified PR estimate in the unknown 
support case, which compares favorably to known optimal rates. 

Keywords and phrases: Density estimation; Kullback-Leibler divergence; Lya- 
punov function; mixture model; stochastic approximation. 
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1 Introduction 

Nonparametric estimation of mixing distributions is an important and challenging prob- 
lem in statistics. Recent progress along the se lines has been mad e wit h the fas t stoch astic 
predictive recursion (PR) algorithm due to lNewton et al.l (119981 ) and iNewtonl (120021 ). PR 
is fundamentally different from existing algorithms, such as EM, in a number of ways. 
Most importantly, PR is not a hill-climbing algorithm . Instead, it learns sequen tially 
like stochastic approximation ( IKushner and Yinl 120031 : iRobbins and Monro! Il95ll ). In 
addition, PR is able to estimate a mixing density with respect to any user-defined dom- 
inating measure. That is , unlike the no nparmetric maximum likelihood estimate, which 
is almost surely discrete ( lLindsavlll995l ). the PR estimate can be discrete, continuous, or 
both, depending on the user's choice of dominating measure. 

Theoretically, it has been shown that the PR estimates of both the mixing and mixture 
densities are consistent under certain conditions; see Section [2] for more details. The goal 
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of this note is to investigate the rate of convergence, about which very little is known. For 
this, we shal l explore further the conn ection between PR and stochastic approximation 
developed in lMartin and Ghoshl (120081 ). To the author's knowledge, results on the rate of 
convergence for general stochastic approximations are only fully developed in the finite- 
dimensional context. Therefore, we shall confine ourselves here to an analysis of PR when 
the possibly misspecified model assumes that the data-generating distribution is a finite 
mixture with known support. In this case, we prove that the PR estimate of the mixing 
distribution converges almost surely at a nearly parametric root-n rate, where the limit 
is characterized by the mixture model closest to the true data-generating distribution 
based on the Kullback-Leibler divergence. This result also sheds light on how one should 
choose PR's tuning parameter in practical applications. 

The PR algorithm itself is not naturally suited for the case when the support of the fi- 
nite mix ture model is unknown. But, by applying the general principle in lMartin and Tokdar 
( 1201 lbl ). I show that PR yields a sort of objective function which can be optimized to 
estimate the unknown support. I apply the paper's known-support results to establish 
rates of convergence for this new PR-based unknown-support procedure. Two numerical 
examples are given to illustrate t he method; for more examples and the full computational 
details, the reader is referred to Martini ( 120111 ). 



2 Predictive recursion 



Suppose independent data Yi, . . . ,Y n are available from a distribution with unknown 
density m(y), which we model as a nonparametric mixture: 



m f (y) 



P(y \u)f(u)dfj,(u), y£&, 



where (y, u) t— > p(y \ u) is a known kernel on <3f x % and / G F is unknown and to be 
estimated. Here F = F(^, u) is the set of all densities with respect to a given cr-finite 
Borel measure /ion^. iNewtonl (120021 ) presents the following algorithm for nonparametric 



estimation of / and rrif based on Yi, . . . , Y n . 

PR algorithm. Choose a density /o G F and a sequence of weights {wi : i > 1} C 
(0, 1). Then, for i — 1, . . . , n, compute mj_i(y) = m/ 4 _ 1 (y) and 



fi(u) = (1 - Wi)fi-i(u) + Wip(Yi | u)fi-x{u) I rrii-iiYi). 
Return f n {u) and m n {y) = rrif n (y) as estimates of f{u) and rrif(y), respectively. 



(2) 



PR has some interesting connections to the nonparametric Bayes estimate in the case 
where the unknown mixi ng distribution is modeled as a random draw from the Dirich- 
let process distribution. iMartin and Tokdar fl2011bh take advantage of this connection 
to motivate a PR-based semiparametric mixture model analysis where an additional 
unknown structural parameter is estima t ed by maximizing a PR-induced approximate 
marginal likelihood. Martin and Tokdarl ( 2011al ) use this general strategy to develop a 
PR-based methodology for large-scale nonparametric empirical Bayes multiple testing. 
In Section H|I apply this method to mixtures with unknown support. 
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Asymptotic convergence properties of the PR estimates f n and m n have only recently 
b ecome avail a ble. Let M denote the set of mixture densities rrtf as / ranges over F. 



Tokdar et all (J2009j) build on the work of iGhosh and Tokdarl (120061 ) to show that when 
the mixture model is correctly specified (i.e., m G M), t hen both f n and m„ . conve rge 
almost surely to / and rrif in their respective topologies. iMartin and Tokdarl ( 120091 ) go 
one step further, showing that if m ^ M, then m n converges to the closest mixture den- 
sity ruf* G M as measured by the Kullback-Leibler divergence. As a corollary, if / is 
identifiable in the postulated mixture model, then /„ converges almost surely to /* in the 
weak topology. They also establish a bound on the rate of convergence for m n in terms 
of the PR weight sequence j w n }. For weights of the form Wi = (i + 1)~ 7 , for suitable 
7, 



Martin and Tokdar (120091 ) obtain a n ' bound on the Hellinger convergence rate of 



m n to raj* for a wide class of kernels y(y I u ). While this rate is comparable to the rate 
obtaine d inlGenovese and Wassermanl ( 120001 ). it leaves a lot to be desired. In fact, simula- 
tions in Martin and Tokdar ( l2009h suggest that the upper bound corresponds to a "worst 
case scenario" rate of convergence, i.e., when /* sits on the bo undary of F. I expect that 
a near ly parametric root-n rate for m n , like that obtained by iGhosal and van der Vaart 
( 1200 ll ). can be achieved by PR, at least in some cases. In Section [3] we show that this 
conjecture holds in the special known finite support case. 



3 Asymptotics for PR with known support 

Assume that the true density m is modeled as a finite mixture. That is, ^ is a finite 
set of size s and \x is counting measure. In this case, F denotes the is — l)-dimensional 
probability simplex, and I write / = {/(«) :«6f }. Then mf(y) = ^2 ue ^p(y \ u)f(u). 
Throughout, all s-dimensional vectors x will be indexed by % , i.e., x = {x{u) :w6f }. 
Also, (•, •) denotes the usual inner-product and || ■ || the corresponding norm. 
We begin by listing two basic assumptions about the mixture model. 

Assumption 1. u i— > p{y \ u) is continuous for each y G W . 

Assumption 2. / is identifiable in model (pQ), i.e., / h- > nif is one-to-one. 

For any density m! on define the Kullback-Leibler divergence of m! from m 
as K(m,m') = J log{m(y)/m'(y)}m(y) dy. Henceforth, I shall silently assume that 
K(m, m') < oo for all m' G M. Then the infimum 

K* = mf{K(m,m f ) : f G F}, 

is finite. It follows from Assumption [TJ that there exists an f* i n the closure of F such 



that K(m,mf*) = K*; see Lemma 3.1 of IMartin and Tokdarl (120091 ). Assumption [2] 



ensures that /* is unique. Allowing the model to be misspecified is particularly important 
here, given that the assumption of known finite support is rather strong. For example, 
even if the support is unknown, the results that follow show that PR does as well 
asymptoticall y as could be hoped for if we simply guess at what % should be. 



Following IMartin and Ghoshl (120081 ). express the PR update j n _\ (->■ f n , n > 1, as 
follows: 

fn(u) = fn-l{u) + W n <5>(Y n , f n -l)(u), «6f, (3) 
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where, for generic y G & and / G F, the mapping <&(y, f) is defined as 



*{y, /)(«) = /(« 



I m/(y) 



Equation ([3]) shows that PR is a special case of a general Robbins-Monro type of stochas- 
tic approximation algorithm designed to find roots of the mapping 

¥>(/)(«) = f(u)\[ rtllM m (y)dy-l}, fe¥, ueW. (4) 



This f(f) is nothing but the conditional expectation of $(Y n , f n -i), under the true den- 
sity m, given / n _i equals /. The following result is an immediate consequence of the 
definitions and construction above. 

Lemma 1. The sequence Z n (u), for u G % , given by 

Z n {u) = ®(Y n , f n ^)(u) - <f(f n -l)(u), (5) 

is a martingale difference sequence with respect to the a-algebra srf n generated by Y\, . . . , Y n . 
Moreover, \\Z n \\ 2 is bounded for all n > 1. 



According to stochastic approximation theory (e.g.. iKushner and Yinll2003l ). conver- 
gence properties of f n , as n — > oo, can be found by investigating the asymptotic behav- 
ior of solutions of an appropriate ordinary differential equation (ODE). Specifically, let 
{/* : t > 0} denote a generic trajectory in F. Then the limiting behavior of solutions 
/* of the ODE df l /dt = </?(/*), as t — > oo, can be used to study the limiting behavior of 
the PR sequence f n} as n — y oo. For this purpose, I will need some basic definitions and 
results from the theory of ODEs. 

Lemma 2. The mixing distribution f* is an equilibrium point of the ODE df l jdt = 
(/?(/*); in other words, (p(f*)(u) = for all u. 

Proof. Plugging /* into the expression in (j3J) gives 

V (n(u) = f(u){J?^m(y)dy- 



By th e fact that /* minimizes K(m,mf), it follows from Lemma 3.3 of lMartin and Tokdar 
(120091 ) that <p(f*)(u) < for each u. But since X]uV 9 (/*)(' u ) vanishes, it must be that 
if(f*)(u) = for each u, proving the claim. □ 

The goal is to show that /* is a stable equilibrium in the sense that any solution to the 
ODE converges to /*, regardless of the initial condition. For this, a Lyapunov function 
will be useful. 

Definition 1. A function i : F — > K is a Lyapunov function at /* for the ODE 
df l jdt = ifif*) if (i) £(f) is continuously different iable in a neighborhood of /*, (ii) 
£{f) > with equality if and only if / = /*, and (iii) £(f) = (V£(f), ip(f)} < 0. 
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Lyapunov's theory, described beautifully iu lLaSalle and Lefschetzl (Il96lf ). states that 
if a Lyapunov function £{f) exists at / = /*, then /* is a stable equilibrium point. Next 
I show that a slight variation of the Kullback-Leibler divergence is a Lyapunov function 
in the present context. 

Lemma 3. The mapping £ : F — >■ [0, oo) given by 

£{f) = K(m, m f ) - K* + Y.u f( u ) ~ 1 (6) 
is a Lyapunov function for the ODE df l jdt = (f(f t )- 

Proof. Properties (i) and (ii) in Definition [T] are obvious. For property (iii), simple 
calculus reveals that </?(/) (it) = — f(u){V£(f)}(u), from which it follows that £(f) = 
~ f ( u )V^^(f)}( u ) 2 — 0- That equality is obtained if and only if / = f* follows from 
the fact that /* is the unique minimizer of K(m,mf) and, hence, the only point at which 
W(/) vanishes. □ 

The function £(f) in ([6]) can be viewed as a Lagrange multiplier version of the 
Kullback-Leibler divergence with the trivial constraint ^2 u f(u) = 1. This is consistent 
with th e interpretation of PR as a n algorithm that asymptotically minimizes K(m,mf) 
over F (IMartin and Tokdarl 120091 ). Another important observation, used in Lemma 
below, is that £(f) is convex. 



Next I state an extension of the PR convergence theorem in lMartin and Ghoshl (120081 ) 
for the case where the true data-generating density m need not belong to the class M of 
mixture models ([T]). For this we need 

Assumption 3. J2 n w n = oo and ^2 n w^ e < 00 ^ or some £ £ (0, 1]- 

In practice, it is common to take w n = (n + l) -7 for 7 G (1/2, 1]. Then Assumption |3] 
holds with e > 7" 1 — 1. 

Theorem 1. Under Assumptions^^ f n — > f* almost surely, where f* is the unique 
minimizer of K(m,mf) overF. 



Proof . In light of Lemmas HH31 the clai m follows from Theorem 5 .2.3 of iKushner and Yin 
f l2003h and the continuity of tp(f); see IMartin and Ghoshl fl2008h . □ 



The main result on a rate of convergence for PR will make u se of a general theo- 
rem on convergence rates of stochastic approximation (jChenl I2002L Theorem 3.1.1); see 
Appendix [A] But two preliminary result are needed first. 

Lemma 4. The sequence Z n in (jSJ) satisfies X^nLi w n~ 5 Z n < 00 almost surely for 
8 G (0, (1 — s)/2], where e is as in Assumption^ 

Proof. Let = E!=i w 1~ < ^' By Lemma (H {X^ '■ N > 1} is a martingale sequence 
and, since {Z n } is bounded, 



E\\X 



N 

E 

n=l 



1 {1 - S) E\\ZJ\ 2 < const 



00 

n=l 



71 



Taking 8 < (1 — e)/2, it follows from Assumption [3] that E 



in N. Then the martingale convergence theorem ( iBreiman 



XjA 2 is uniformly bounded 



1992L Theorem 5.14) implies 



that X N converges almost surely, completing the proof. 



□ 
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An additional assumption about the weights is required. For weights given by w n = 
(n + l) -7 , this assumption holds as long as 7 < 1. 

Assumption 4. {w n } satisfies — w^ 1 — >■ 0. 

Lemma 5. Let J = Dip(f*) denote the derivative of ip evaluated at f = /*. If f* is 
in the interior of ¥, then all eigenvalues of J are negative. 

Proof. Simple calculus reveals that J = Dip(f*) is of the form 

J(u, v) = -r(u) [ P{V 1 U)P . {V J V) m{y) dy, u,v£W. 
J rrif*{y) 

In matrix notation, write J = — diag(/*) ■ V 2 £(f*), where diag(/*) is a diagonal matrix 
with the elements of /* as its diagonal entries, and V 2 £(/*) is the second derivative matrix 
of £(f) evaluated at / = /*. Since /* is in the interior of F, all entries are positive and, 
hence, diag(/*) is positive definite. Since £(f) is convex on F, V 2 £(/*) is also positive 
definite. The claim follows from the fact that the product of these two positive definite 
matrices, which is — J, must have positive eigenvalues. □ 

An interesting observation is that the matrix P = —J T , the negative transpose of the 
Jacobian J in Lemma [5j is a transition probability matrix for an irreducible, aperiodic 
Markov chain on . This chain is also reversible and has /* as its stationary distribution. 
But how this observation might be useful in studying the asymptotic convergence of PR 
remains unclear. 

In light of Assumptions []~HH Lemmas H] and [5], and the existence of a Lyapunov func- 
tion proved in Lemma [3J the main result on the convergence rate of PR is a consequence 
of Chen's theorem in Appendix lAl 

Theorem 2. Assume that f* lies in the interior ofF. Then under Assumptions U\\4^ 
\\f n — f*\\ = o(w^) almost surely for 5 in Lemma\^ 

When the weights are given by w n — (n + 1)~ 7 , for 7 e (1/2,1), it follows from 
Theorem [2] and the previous discussion that \\f n — f*\\ = o(ri - ( 1-1 / 27 )) almost surely. 
Since 7 can be chosen arbitrarily close to 1, it follows that the convergence rate can be 
made arbitrarily close to n~ l l 2 almost surely. 

A slightly stronger version of Theorem [2] could be obtained if weight sequences were 
allowed to satisfy w~ +1 — w' 1 — > a, with a > 0. For example, if w n = (n + l) -1 , then 
a = 1. This extension would make the root-n rate possible, but it would require all 
eigenvalues of J in Lemma [5] to be less than —1/2. At this point it is unclear whether 
this claim is true; s tandard bounds for eig e nvalu es, such as those in Gershgorin's theorem 
or Proposition 2 in lDiaconis and Stroock fll99lh . are not helpful in this case. 

Almost sure rates of convergence for the mixture density m n to mj* are available as 
consequences of Theorem [2] The L\ rate follows immediately from its definition, while 
the rate for the Kullback-Leibler contrast, K(m, m n ) — K*, requires a simple second-order 
Taylor approximation of £(f) at / = /*. 

Corollary 1. Under the conditions on Theorem^ f \m n — m/*| dy = o{w s ^) almost 
surely for 5 in Lemma\^ Likewise, K(m, m n ) — K* = o(w 2 1 <5 ). 
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Martin and Tokdarl ( 120091 ) derive a bound of oiW' 1 ) for K(m, m n ) — K* in the general 
comp act ^ case, where W n = When w n = (n+ 1)~ 7 , the bound for K(m, m n ) — 

K* in Martin and Tokdarl ( 120091 ) becomes o(ri~( 1-7 )), which can be no faster than n -1 / 3 
under their conditions. Compare this to the rate of o(n _ ^ 2_1//7 ^) obtained from Corollary[TJ 
which is considerably faster than n -1 / 3 for 7 « 1, albeit for the special known finite 
support c ase. So, regarding the PR weights {wi : i > 1}, the message here, contrary 
to that in Martin and Tokdarl (120091 ) . is that the faster the weights vanish the faster the 
overall convergence. 



4 PR with unknown support 

The PR convergence theory in the previous section assumes the finite support is known 
and only the mixing distribution is unknown. In practice, however, both the support 
and mixing distribution are unknown and to be estimated. To close this gap, I propose 
here a new PR-based approach for handling the unknown support case. The asymptotic 
results in Section [3] will be used to prove consistency of this new procedure. Two simple 
examples are also given for illustration, but the com putational details, simulations, and 



extensions will be presented elsewhere ( lMartinll201ll ). 



4.1 Setup 

Let % be a compact set, large enough that there is a finite mixture supported in that 
gives a sufficiently accurate approximation to m. Take U to be a generic finite subset of 
By treating U as the fixed support, a run of PR will produce a sequence of estimates 
{(fi,u, m i,u) : i = 1, . . . , n} of the mixing and mixture distributions, whose dependence 
on the chosen support set U are now made explicit. In the same vein, write Fj/ for the 
(\U\ — l)-dimensional probability simplex and define K*(U) = inf {K(m, mf t u) '■ f £ Fy}, 
the smallest Kullback-Leibler number for mixtures supported on U. 

The jumping off point is that the result K(m, m n! u) ~ K*(U) = o(w^ s ) of Corollary [TJ 
holds "pointwise" for all U ; that is, the particular support U plays no role in the analysis of 
Section [3J Thus, in the present case where the support is unknown, a reasonable strategy 
is to estimate the support by minimizing, over U, some estimate of K(m,m n ^u). This is 



the approach advocated by lMartin and Tokdarl ( 1201 lbl ). Indeed, by making connections 



to PR and Dirichlet process mixture models, they argue that, in the present context, the 
appropriate estimate of K(m, m n jj) is 

if n (£/) = f>g "^vv UcW > \U\<oo. (7) 



i=l 



Then the goal is to minimize K n (U) over U. But since it is not possible to perform this 
optimization over all finite U C some adjustment must be made. Consider starting 
with a fixed finite subset % of % obtained by chopping up % into a sufficiently fine grid, 
so that |^| is large. Then the collection of all subsets U of ty£ is hu ge — it has 2^ — 1 



elements — but finite so it is possible to minimize K n (U) over U C % . iMartinl ( 120111 ) uses 



a simulated annealing strategy to perform this optimization. Once the minimizer U n of 
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K n (U) is obtained, PR is run once more to produce f n y and m n q- as estimates of the 
mixing and mixture distributions, respectively. 

4.2 Large-sample theory 

For simplicity, I will assume that the true density m is indeed a mixture density of the 
postulated form with support contained in % ; the mo re general case can be han dled 



similarly, but with an additional technical assumption (IMartin and Tokdarl l2011bl . As- 
sumption 6). Also, assume that w n = {n + 1)~ 7 for some 7 G (0.5, 1). To get convergence 
of the approximation K n (U) to K*(U), I will need one additional assumption, stated 
next, which holds for many common kernels, including normal and Poisson. 

Assumption 5. There exists a finite constant A > such that 



max 



Und er Assumptions [IH5J one can follow the proof of Theorem 2 in IMartin and Tokdar 



(j2011bl ) to conclude that, for each fixed U C % . 



lim 

n— >oo 



n {K n (U) - K*(U)} - ^ ^{#(™, - K*{U)} 



i=i 



0. 



almost surely, for any sequence c„ that satisfies c n = 0(n 1//2_e ) for some e > 0. It 
follows from Corollary [1] that the summation in (jSj) is of the order n 1 / 7-1 . So, if e > 
max{0, 7" 1 — 3/2}, the right-most term in the modulus in (jSJ) vanishes and, therefore, so 
must the left-most term. This proves that, for 7 « 1, K n (U) — > K*(U) pointwise in U 
at a rate just slower than n~ l l 2 . But since 2^ is finite, the convergence is also uniform. 
The following theorem summarizes this result. 

Theorem 3. Choose weights w n = (n+1)™ 7 with 7 G (0.5, 1) and let e > max{0, 7 _1 — 
3/2}. Then, under Assumptions QH3 n 1 ^ 2 ~ e {K n (U) — K*(U)} — > almost surely as 
n — > 00. Moreover, since U ranges only over a finite set, n l l 2 ~ £ K n {U n ) — > = K*(U*), 
where U* C is the support of the true mixture distribution. 

If I define a distance d between two sets as the cardinality of their symmetric difference, 
then Theorem [3] states that d(U n , U*) = o(n~ 1 ^ 2+e ). In other words, U n is a nearly root-n 
d-consistent estimate of U*. Furthermore, a nearly root-n rate of convergence for f n ^ 
can be obtained, which I now sketch. With a slight abuse of notation, I can bound the 
total variation distance between f n jy and f* as follows: 



drv(A,ft,,r) = EiW«)-n«)i 

E W«)+ E /*(«)+ E iw«)-rc«)i 



< rf(t> n , C7*) + d TV (f nfiri ,f n ,u*) + d TV (f n ,u*,f). 
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The two outer-most terms on the right-hand side vanish at a nearly root-n rate according 
to Theorems [3] and El respectively. The middle term is more difficult to analyze, but it is 
clear that the data- dependent PR mapping U \-> f n< u is, in some sense, continuous in U. 
So, the convergence of d TV (f nl y n , f n ^u*) is also driven by d(U n ,U*). Therefore, the rate 
for d TV (f n jj , / *) mus t also be nearly n" 1 / 2 . 

Recall that IChenl (119951 ) showed that, for finite mixtures, the optimal rate of con- 



vergence is n ' . In that case, the unknown finite support is allowed to be anything, 
essentially nonparametric, so the rates are relatively slow. In contrast, by restricting the 
set of candidate supports to subsets of a large but ultimately finite set % , I am able to 
achieve a nearly parametric root-n rate of convergence. 



4.3 Examples 

Here I give two relatively simple real-data examples — a Gaussian location mixture and a 
Poisson mixture — to illustrate the potential of the proposed method. 

Example 1. Under the Big Bang model, galaxies should for m clust e rs an d the rela- 
tive velocities of the galaxies should be similar within clusters. iRoederl (119901 ) considers 
velocity data for n = 82 galaxies. She models this data as a finite Gaussian mixture, 
with the number and location of the mixture components unknown. The assumption is 
that each galactic cluster is a single component of the Gaussian mixture. The presence 
of multiple mixture components is consistent with the hypothesis of galaxy clustering. 

We apply the methodology outlined above to estimate the mixing distribution /. 
We will consider a simple Gaussian mixture model in which each component h as vari- 
ance a 2 = 1, based on the a priori considerations of Escobar and Westl (119951 ). From 
the observed velocities, it is apparent that the mixture components should be centered 
somewhere in the interval % = [5,40], so we choose a grid of candidate support points 

= {5.0, 5.5, 6.0, . . . , 39.5, 40.0}. Figure [TJ shows the corresponding estimates of the 
mixing and mixture distribution. The PR method identifies six ga laxy clusters, and the 
estimates of U and / closely match those of llshwaran et al.l (120011 ) and others. 



Example 2. iKarlis and Xekalakil (120011 . Table 1) present data on the number of de- 
faulted installments in a Spanish financial institution. This data has a high number 
of zero counts, as well as substantial overdispersion. This suggests a Poisson mixture 
model, and here we compare the PR-based estimates to others presented in the litera- 
ture. The first three rows of Table [TJ show the estimates of (/, U) for three methods in 
an zero-inflated Poisson mixture model. These include an estimate based on the AIC 



penalty, the SCAD-based penalized likelihood approach of IChen and Khalilil (20081 ). and 
a minimum Hellinger distance method for count data ( Woo and SrirarnT2007f ). I start 
by bounding the support by % = [0, 30] and taking % to be a set of 100 equispaced 
points in ' . All but the Woo-Sriram estimates have five support points, including zero. 
Besides this, we find that the corresponding estimates are quite similar. An attractive 
feature of this method is that no special adjustments are needed for zero-inflation. That 
is, zero-inflation can be achieved by simply including zero in the grid % and letting the 
data decide if a mass at zero is appropriate. Fitted values were obtained for each of 
the four methods (not shown) and I find that, for small y- values, where the observed 
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Estimates 




(u 2 J(u 2 )) 


(«3,/(«3)) 


(u 4 J(u 4 )) 


(u 5 J(u 5 )) 


AIC-BIC 
MSCAD 
WS 
SASA 


(0, .314) 
(0, .328) 
(0, .373) 
(0, .328) 


(.298, .435) 
(.302, .417) 
(.36, .385) 
(.303, .418) 


(4.37, .200) 
(4.19, .193) 
(4.52, .199) 
(4.24, .201) 


(10.99, .048) 
(9.78, .055) 
(11.26, .043) 
(10.91, .051) 


(26.51, .002) 
(20.01, .007) 

(27.27, .002) 



Table 1: Estimates of (f,U) f or the financial da t a Poi sson mixture in Example [2] The 
first three rows are taken from I Chen and Khalilil (120081 . Table 10). 



counts are relatively large, the PR-based estimate appears to provide a better overall fit 
compared to the others. 



Acknowledgments 

The author thanks Professor Surya Tokdar for a number of helpful suggestions, and the 
Department of Mathematical Sciences, Indiana University-Purdue University Indianapo- 
lis, for their hospitality when a portion of this work was completed. 



A Convergence rates for stochastic approximation 

Consider a stochastic approximation process {X n : n > 0} which, for fixed initial value 
X = x , is defined recursively as follows: 

X n = + a n </?(X n „i) + a n Z n , n > 1. 

The process is designed so that X n — > x* almost surely, where x* satisfies <p(x*) = 0. We 
shall assume that {X n } bounded; otherwise, some truncation or projection techniques 
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are needed (IChenl 120021 ; iKushner and Yinll2003l ). The PR estimates f n are constrained 
to the simplex, so they satisfy this boundedness condition trivially. Next are the main 
assumptions of the theorem. 

Al. The weights {a n } satisfy a n > 0, a n — > 0, ^ n a n = oo, and — a~ l — > a for 
some a > 0. 

A2. There exists a Lyapunov function £(x) at the equilibrium point x* of the ODE 

dxt/dt = (p(xt). 

A3. o}~ & Z n < oo almost surely for some 5 G (0, 1/2). 

A4. (p(x) is continuously differentiable, and all eigenvalues of J + a5I have negative real 
parts, where J = Dtp(x*). 

Chen's Theorem. Under A1-A4, \\X n — = o(a 5 n ) almost surely. 
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